mbpp+

pairwise wins

p-values

result table

model pass1 win_rate elo
0 gpt-4-1106-preview 0.733 0.795 1285.135
1 meta-llama-3-70b-instruct 0.690 0.751 1244.763
2 opencodeinterpreter-ds-33b 0.685 0.730 1225.049
3 white-rabbit-neo-33b-v1 0.669 0.694 1191.603
4 opencodeinterpreter-ds-6.7b 0.664 0.692 1193.917
5 deepseek-coder-6.7b-instruct 0.656 0.665 1167.416
6 xwincoder-34b 0.648 0.651 1156.001
7 bigcode--starcoder2-15b-instruct-v0.1 0.651 0.648 1149.482
8 HuggingFaceH4--starchat2-15b-v0.1 0.646 0.639 1145.369
9 mixtral-8x22b-instruct-v0.1 0.643 0.631 1138.261
10 starcoder2-15b-oci 0.632 0.618 1128.275
11 CohereForAI--c4ai-command-r-plus 0.635 0.613 1125.543
12 speechless-starcoder2-15b 0.624 0.598 1116.710
13 Qwen--Qwen1.5-72B-Chat 0.616 0.579 1100.714
14 deepseek-coder-6.7b-base 0.587 0.526 1056.786
15 codegemma-7b-it 0.569 0.490 1030.377
16 speechless-starcoder2-7b 0.563 0.465 1014.060
17 databricks--dbrx-instruct 0.558 0.464 1004.553
18 microsoft--Phi-3-mini-4k-instruct 0.542 0.439 982.766
19 codegemma-7b 0.524 0.412 966.646
20 mixtral-8x7b-instruct 0.497 0.367 926.871
21 octocoder 0.497 0.363 929.118
22 codegemma-2b 0.466 0.323 893.445
23 open-hermes-2.5-code-290k-13b 0.458 0.311 880.735
24 gemma-1.1-7b-it 0.450 0.289 865.755
25 starcoder2-3b 0.439 0.276 851.507
26 gemma-7b 0.434 0.272 848.045
27 codegen-6b 0.429 0.261 837.629
28 mistral-7b 0.421 0.242 820.887
29 codet5p-2b 0.381 0.198 774.521